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AUTOMATIC PROBABILISTIC KNOWLEDGE 
ACQUISITION FROM DATA 

William B. Gevarter 


ABSTRACT 

This memorandum documents an outline for a computer program for extracting significant 
correlations of attributes from masses of data. This information can then be used to develop a 
knowledge base for a probabilistic “expert system.” The method determines the “best” estimate 
of joint probabilities of attributes from data put into contingency table form. A major output 
from the program is a general formula for calculating any probability relation associated with 
the data. These probability relations can be utilized to form IF-THEN rules with associated 
probability, useful for expert systems. 

INTRODUCTION 

Knowledge acquisition is a major bottleneck in developing “expert systems.” Thus a recent 
focus of the artificial intelligence (AI) community has been on “machine learning.” Though 
this has been a theme in AI for several decades, it has only been in the last few years, spurred 
on by the popularity of expert systems, that machine learning has received major attention. 

Because of the emergence of sophisticated expert system building tools such as KEE (In- 
tellicorp, 1985) and ART (Williams, 1985), and a host of follow-on simpler systems, the main 
difficulty in building conventional expert systems has now shifted to knowledge acquisition 
and choosing the most appropriate knowledge structures and representations. Thus knowledge 
acquisition is an important and timely research area for NASA to investigate. 

Approaches to knowledge acquisition have included psychological techniques for interviewing 
experts (Boose, 1984 and Kahn et al., 1985) and automatic production of classification-oriented 
expert systems from examples as exemplified by the TIMM (General Research Corp., 1985) 
expert system building tool. Thus far, such approaches have had only a limited range of 
successful applications. Therefore, methods of knowledge acquisition and knowledge extraction 
from data are important current AI research topics. Knowledge acquisition can be construed 
as learning. 

AI learning systems can be classified according to the following strategies: 

1. Rote learning 

2. Learning from instruction 

3. Learning by analogy 
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4. Learning by examples 

5. Learning from observation and discovery 

The last learning strategy is perhaps the most recent and the most exciting. This strat- 
egy is exemplified by Lenat’s (1982) AM and EURISKO systems and Langley et al. (1983) 
BACON system. These discovery systems are heavily knowledge-based. The usual focus of 
discovery systems is on discovering concepts. Cheeseman (1984) has explored another facet, 
that of developing specific correlations from data. His approach is probabilistic in nature and is 
primarily procedurally (syntactically) oriented. This approach is particularly appropriate when 
the source of the knowledge is in the form of large masses of undigested data, such as those ob- 
tained from wind tunnel tests; spacecraft observations; computer simulations; or psychological, 
medical, and social surveys. 

Commercial AI learning systems such as Expert-Ease (Derfler, 1985) and TIMM are aimed 
at developing decision aids from examples. In general, learning from examples is predicated 
on positive examples, which promote generalization . . . and negative examples, which reduce 
generalization. However, commercial learning tools are not designed to extract significant 
information from data for which no conclusions have yet been reached. 

Many researchers believe that truly powerful intelligent systems will be difficult to achieve 
without a learning component because of the huge amount of knowledge many future expert sys- 
tems will require, as well as the need to improve performance by learning new search heuristics 
as the system is used. 

This memorandum outlines an approach to probabilistically determining significant rela- 
tionships in masses of data. This can be particularly important because NASA has masses of 
unevaluated data from its space explorations. Automatic means to find significant correlations 
in these data can begin to reduce this mammoth NASA reserve data bank. The approach 
outlined in this paper draws on previous work by Cheeseman (1983, 1984) in this area. Using 
this approach, the resultant information, probabilistically extracted from the data, allows cal- 
culations of the conditional probability of any proposition associated with the data, given any 
combination of evidence. This information can be used as clues for discovering more causal ex- 
planations. The probabilistically extracted information can also be transformed into IF-THEN 
(“condition-conclusion”) rules (with associated prob bility) found useful in expert systems. For 
example, the probability of A given B and C is p, written as 

P(A \B,C) = p 

can also be written as 

IF B AND C, THEN A ( with probability p) 

The system described in this memorandum does not generate rules explicitly. It generates 
and stores significant joint probabilities instead. Particular conditional probabilities can be 
calculated from this information as required by noting that a conditional probability can be 
written as the ratio of corresponding joint probabilities. Thus, for example 



PROBLEM DEFINITION 


The problem explored in this paper is that of extracting information from data which can 
then be used to form the knowledge base of a probabilistic expert system. The resultant for- 
mulation summarizes all the probabilistic information found from the data and can be used 
to calculate any probability associated with the data. The information found is the significant 
joint probabilities of attributes from data (which has resulted from a collection of observations). 
An illustrative example followed in this paper determines the probabilistic relationships of can- 
cer to smoking given a set of observations on people over the age of 60 whose hypothetical case 
histories are obtained from the completion of the following questionaire 


A. SMOKING HISTORY 

1. Smoker 

2. Non smoker not married to a smoker 

3. Non smoker married to a smoker 

B. CANCER 

1. Yes 

2. No 

C. FAMILY HISTORY OF CANCER 

1. Yes 

2. No 

A set of data thus obtained from a survey of 3428 individuals might appear as shown in 
Figures la and lb (called “contingency tables” 1 ). 

The numbers shown in each box or cell refer to the total number of individuals who have 
that combination of attributes. Thus, the number of smokers who do not have cancer despite 
a family history of cancer is given as 410. This can be written in a shorthand notation as 

N& f C = 410 

This states that there were 410 individuals that had the attribute values 

A (SMOKING) = 1 (Smoker) 

B (CANCER) = 2 (No) 

C (FAMILY HISTORY =1 (Yes) 

OF CANCER) 


'Appendix A indicates how original data can be put into contingency table form. 
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B 

CANCER 


B 

CANCER 


A 

SMOKING 



1 

2 


Yes 

No 

1 . Smoker 

130 

410 

2. Non smoker 

62 

580 

3. Non smoker 

78 

520 

married to 



a smoker 




C FAMILY HISTORY OF CANCER 
1. Yes 


(a) 


A 

SMOKING 



1 

2 


Yes 

No 

1 . 

110 

640 

2. 

31 

460 

3. 

22 

385 


C FAMILY HISTORY OF CANCER 
2. No 

(b) 


Figure 1: DATA ON SMOKING AND CANCER IN U.S. POPULATION OF AGE GREATER 
THAN 60. 

In general, we can refer to the number of individuals with the ith value of attribute A, the jth 
value of the attribute B and the kth value of the attribute C, as 

N$ c or N ijk 

where i, j, k are the numbers associated with the values of the attributes. 

We will assume that the range of values for each attribute is complete (made so by adding 
the value “other,” if necessary) so that the the number of people obtained by summing the 
numbers for each of the values of an attribute (e.g., B (CANCER)) will always add up to N 
(the total number of individuals surveyed). 

If we add up the numbers in each row or column of Figures la and lb, we obtain the marginal 
values (placed in the margins) as shown in Figures 2a and 2b. If we sum across C (FAMILY 
HISTORY OF CANCER), we obtain Figure 2c, which relates SMOKING to CANCER without 
breaking up the results with respect to family history. 

In equation form, these summations can be simply written as 

k 

N jk = J2N ijk 

i 

N ik = J2N ijk 

j 


( 1 ) 

( 2 ) 

( 3 ) 
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C FAMILY HISTORY OF CANCER 
1. Yes 
(a) 


C FAMILY HISTORY OF CANCER 
2. No 
(b) 



Figure 2: CANCER DATA ON U.S. POPULATION OF AGE GREATER THAN 60. 

5 




and 


j j k 

( 4 ) 

^ = E^» = EE% 

( 5 ) 


k k j 


Similarly, the total number of individuals, N, is simply the summation across all the indices 

"=£££*«» («) 

i j k 
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APPROACH 


The approach taken for finding joint probabilities of attributes is to maximize the entropy of 
the discrete probability distribution while satisfying the constraints imposed by all the known 
probabilities. This can also be thought of as achieving the maximum uncertainty in the values 
of these higher-order probabilities; as any lesser uncertainty would imply further constraints. 
“Maximum entropy” probability values distribute the uncertainty ( H ) as evenly as possible 
over the underlying probability space in a. way consistent with the constraints. 

The “known probabilities” (constraints) are determined by applying a significance test to 
the data. A “minimum message length” criterion is used as the test for significance between 
the observed values of occurence of higher-order combinations of attributes in the data and the 
values calculated by the maximum entropy approach using the currently known constraints. 
If the minimum message length required to encode an observed value of occurence assuming 
a chance distribution is less than that given by the maximum entropy approach (using the 
constraints thus far), then the observed value is deemed significant and the joint probability 
associated with it is added to the list of constraints. Once all the significant joint probabilities 
are determined, any other probability relationships associated with the data can be readily 
calculated from the resulting succinct equation. 

MAXIMIZING THE ENTROPY 

The entropy (uncertainty) is given in terms of the joint probabilities as (Jaynes, 1979) 

H = - Pijk. Jog Pijk... (7) 

ijk... 

where ij,k,.„. are the indices of the values of the attributes and pij k is the joint probability 
(probability of the simultaneous occurrence of the ith value of the attribute A, the jth value of 
the attribute B, and the kth value of attribute C). 

For simplicity, in the rest of this paper, we will only consider three attributes — A, B, C. 
The extension to a larger number of attributes is straight forward. 

To maximize the entropy we first add to H the constraints (associated with the prior known 
probabilities) using LaGrange multipliers (la’s) to form H'. We then take the derviative of H' 
with respect to each of the unknown variables — the probabilities and the LaGrange multipliers — 
and set them equal to zero to find values for the variables that maximize H'. Thus 

H'= -'LijkPijklogPijk 
+w(i - EijkPijk ) 

+w, (p,- - Ej*; pijk) + *>j(Pj - £,* Pijk) + «/* ( Pk - e,, Pijk) (8) 

{Pij ~ Efc Pijk) + VJik(pik — E j Pijk) + w jk{Pjk ~ E» Pijk) 


+ ••• 

Taking derivatives with respect to the probabilities, we obtain 
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dH'/dpijk = • 

“ log Pijk -1-W-Wi-Wj- ... 

o 

II 

1 

s’* 

1 

(9) 

Therefore 

Pijk = e _ (“ 0+ “«' + " +w o + ") 


(10) 

where we have defined 

wo = w + 1 


(11) 

From Equation 10 

Pijk — aoaiaj...dij... 


(12) 

where we have defined 

a.{ = e~ w< 


(13) 


Taking the partial derivatives of H' with respect to the w multipliers and setting them equal 
to zero, simply returns our given contraints 


dH'/dw = 0 -> Y^Pijk = 1 

ijk 

dH'/dwi -0-y ^Pijk = Pi 
jk 

dH'/dwj = 0 -> J^Pijk = Pi 

ik 

dH'/dwk = 0 -* 22 Pijk = Pk 
ij 

dH'/dwij = 0 -► ^2 Pijk - Pij 
k 

dH'/dWik = 0 -» 22 Pijk = Pik 
i 

dH'/dwjk = 0 -+22 Pijk = Pjk 
i 

(and so on for any higher-order known prior probabilities). 

Substituting for p,yjt from Equation 12 into Equations 14 - 20, yields 

do ^ ' a i a j -..(lij&ik--- = 1 
ijk 

flO®» ^ ') dj< l k a ij a ik-" — Pi 
jk 

a 0 a j 22 a i a k a ij a ik--- = Pj 
ik 


(14) 

(15) 

(16) 

(17) 

(18) 

(19) 

( 20 ) 

( 21 ) 

( 22 ) 

(23) 
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( 24 ) 


a O a k y , a i a j a ij a ik ••• — Pk 
ij 


These summations are simplified if we group the summations by the indices 2 . Thus, for example 


a o y \ a t y y °-j y ' aka^ajk — 1 

< 3 k 

(25) 

o-oai y; djttij y ) a^aikajk — Pi 
3 k 

(26) 

flo CLj yy a k<kk*jk = Pi 

t k 

(27) 

a 0 a k J. a i a ik y ] a j a ij a jk = Pk 
» 3 

(28) 

flo dijOfUj y ) a^di k a jk = Pij 

k 

(29) 

do aikdiaic aja^ajk = Pik 

3 

(30) 

a O a jfc a i a fc ^ a i a ii a ik ~ Pjk 

(31) 


As will be illustrated later using our example, this set of simultaneous equations is iteratively 
solved for a values. Initially, the a values are calculated from the first-order probabilities 
derived from the data. Then the a values are recalculated using any known prior second-order 
probabilities. Based on the resulting a values, predicted second-order joint probabilities of the 
attributes are then calculated (using Equation 12 or equivalently Equations 25 - 31) and the 
observed data is evaluated to see whether it differs significantly from the values predicted. 

If the predicted probabilities of the observed values of the combinations of attributes are 
less than the probabilities of their occurence owing to pure chance, then the values observed 
from the data are statistically significantly different from those calculated from the constraints 
used thus far. In this case, these significantly different observed values are used to form new 
constraints and the a values are recalculated. This process is repeated at this level and each 
successive level until all the observed statistically significant correlations are accounted for. 

PREDICTING THE VALUE OF N ijk 

The probability of finding the number of occurences having the ith value of attribute A, 
the jth value of B, and the kth value of C is given by the well-established “binomial distribution” 

p(N ijk | Pijk, N) - ^ J ( Pij k) Niik (l - Pijk ) N ~ Ni * (32) 

2 A method for calculating such “sum of products” equations is given by Appendix B. 
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where 


N is the total number sampled 

Pijk is the prior probability (calculated from the a values) of the ith value of A, the jth value 
of B, and the kth value of C occurring together in the population being sampled. 

The predicted mean of Nijk is given by 

( = N Pijk (33) 

and the associated standard deviation of Nijk is given by 

{Nijk)sd = \J N Pi jk(]- — Pijk) (34) 

The mean and standard deviation are useful for estimating the significance of the difference 
between the observed value of Nijk and the predicted value. 


SIGNIFICANCE TESTING OF THE OBSERVED VALUES 
OF THE N^k 1 & 

In this section we determine the significance of the observed values of the N’s by compar- 
ing the probability of their chance occurence with the probability predicted by the probability 
formula, Equation 12, derived from the constraints found thus far. If, for example, the proba- 
bility of occurence by chance of an observed value of JVy* is greater than that predicted by the 
formula, we regard that Nijk as significant and use it as a constraint to revise our probability 
formula so that it will predict the observed value. The revised formula is then used to predict 
the probability of occurence of the remaining observed values of the N’s and the procedure 
recursively repeated until no further significant N’s are found. 

The procedure, outlined above, starts by comparing the probabilities of the chance occurence 
of the observed values of the second-order N’s with lie probabilities predicted by Equation 12 
and uses the resultant most significant N to update Equation 12. The remaining second-order 
N’s are then evaluated using the new predictions and the procedure is recursively repeated until 
all the significant second-order N’s have been determined. This procedure is then repeated for 
the third-order N’s and so on. 

In determining the probability of an observed value of Nijk ocurring by chance, we note 
that the values in the cells must add up to their constraining marginal values. Thus no cell 
can have a value exceeding its significant marginals minus the values of any other significant 
cells associated with those marginals. This sets a maximum value for a cell. If for a cell all the 
other cells associated with one of its significant marginals have been found to be significant, 
then the cell must have the value observed for it. If the cell is not so constrained, then for the 
chance case it is equally likely that it will have any integer value from zero to its maximum 
value (discussed above). 

We now derive the equation needed for comparing the probabilities of occurence of the 
observed value of Nijk as calculated by Equation 12 and as calculated for chance. 
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The well-known “bayesian formula” for calculating the posterior value of a hypothesis, hi, 
given data, D, is 

I D) = 

where p(hl) is the prior probability of the hypothesis. For our purposes, a more convenient 
relative form of B aye’s rule gives the likelihood ratio of the posterior probability of two different 
hypotheses (given the same data) 

P(h 1 ] D) _ Pfol) P(D I ftjO 

p(h2 | D) p{h2) p(D | h2) ( 1 

Taking the log of the likelihood ratio, we obtain 

1)1 P p{h2 | d] = lnp fo 2 ) ~ 1 ^2)] - [- lnp(fcl) - lnp(D | hi)} (37) 

In information theory, the minimum message length required to encode (communicate) a 
particular choice (e.g., hi) from a set of mutually exclusive and exhaustive hypotheses is pro- 
portional to — lnp(/il) (Jaynes, 1979). Thus, Equation 37 can be interpreted as proportional 
to the difference in the minimum message lengths required to represent the two hypotheses 
given the data. 

There are two basic hypotheses for the JVy* obtained from the data for a particular cell ijk. 

HI Given that we have found M nth-order significant constraints (joint probabilities), there 
are no more nth-order significant constraints, (i.e., Equation 12 adequately predicts the 
probability of occurence of the observed values of the remaining nth-order N’ s.) 

H2 Given that we have found M nth-order significant constraints (joint probabilities), there 
is at least one more nth-order significant constraint and this cell is the next nth-order 
significant constraint. 

The hypothesis H 2 can be broken up into two hypotheses 
H2' There is at least one more nth-order significant constraint. 

H 2" This cell is the next nth-order significant constraint. 

Thus 

p{H2) = p(H2' H2") = p{H2') p{H2" \ H2') (38) 

where without any other information 



p(hl)p(D \ hi) 
P(D) 


(35) 


p(H2" | H 2') 


• 1 

remaining available cells at the current order 

1 

no. of cells at this order — M 


For the highest order this is simply 


( 39 ) 
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1 '"I = (7JJC-. 1 - M) 

where /, J, K are the total number of values of the A, B, and C attributes, respectively. 

For hypothesis H 2, lacking prior knowledge, the value of Nij k is equally likely to be any 
integer in the range of values available to it. Thus, for third-order combinations 

IF 

[JK-Noi(N$ c )\ ' 

[IK - N 0j (N£ c )\ 

[IJ - No k (N$ c )} 

[J-NoiiNtf*)} 

[ K-Noi(N£ c )] >1 

[I-No^Nff)] 

[K - Noj(N? c )] 

[I-No k (Ng) } 

[J - No k (N™)} , 

THEN 

p{D | H2)=p{N ijk | If 2) 

=p(N ijk | Ni, N it N k , I, J, K, sigmf icant(N** c s), H2) 

[N? ~ Y.yz yz&k signif icani{N$ c s)\ 

[Nf - Ezz xz: :k signif icant ( N£$ g s) } 

[Wf “ E* y xyjtij significdnt(N* B k c s)\ 

[significant N* B - Ezz^tk significdni(N$ Bc s)\ 

[significant Nff - V9 tj significatit(N$ c s )] 

[stjnt / icant N B k c - T. x X9 ti significances )] 

ELSE 

p(D \H2) = 1 

(as the value of Nij k is then completely determined from the marginal values and the significant 
values previously found) 
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where we have defined 

Noi{N iyz s) = number of A^’s found significant that have the ith value of the A attribute 

(42) 

etc. 

Note that in Equation 41 our constraints are the first-order marginals — N j 4 , N? , N k — 
and any higher-order marginals found significant in our analysis or originally given as significant. 

Using Equation 38 in Equation 37, the hypothesis HI (that there are no more significant 
nth-order constraints so that the prior probability calculated from the a’s is adequate) is more 
likely than H2 if 

[-lnp(H2')-\np{H2" \ H2')-\np{D\ H2)] - [- Inp(tfl) - lnp(D | HI)] > 0 (43) 

or in abreviated form as 


m2 - ml > 0 

where using Equations 39 and 41 in Equation 43, 

IF 


THEN 


( 


min 


V 


[JK - N 0i {N$ c )} 
[IK - No^N^)} 
[IJ - No k (N$ c )} 
[J - NoiiNfi 8 )} 
[K - Noi(N{f)\ 
[I-NojiNff)} 
[K - N 0j {N? G )} 
[I-No k (N^)\ 

| J-No k (N BC )} 




> 1 


J 


(44) 


m2 = — In p(H 2') + In (no. of cells at this order - M) 

( [X? - E y * y^ jk 8ignificant{N$ c s)\ 

[N? ~ zz&k significant^? 0 s )] 

i N k - E xy xyjtij significances)} 

[significant N t j B - £, „^ k significant(Nfi BC s)\ 
[significant Nfif — Ey y/y significant{N^ c s)\ 
[significant Nf k ° - E* significances)} 


+ In {min 


+ 1} 


(45) 
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ELSE 


m2 = — lnp(H2') + In (no. of cells at this order - M) 
Similarly, using Equation 32 in Equation 43 


ml = - lnp(//l) - 



+ N m In P ijk + {N- N ijk ) ln(l - P ijk ) 


( 46 ) 


For the observed value of Nij k to be statistically significant requires (from Equation 44) 
that 

m2 - ml < 0 (47) 

If the observed Nij k is statistically significant, then it forms a new constraint and a new set 
of a values is calculated that will predict it when using these new values in Equation 12. 
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CALCULATING THE INITIAL a VALUES BASED ON THE 
FIRST ORDER PROBABILITIES 


The first order probabilities are readily calculated from the data as 

pf=Nf/N 

pf=N?/N (48) 

P C k=NZ/N 


If we start with these as our only initial constraints, then from Equations 25 - 28, we obtain 
for the data from our example 


where we have defined 


ao a A a B a c = 1 

(49) 

ao a A a B a c = p A — .38 

(50) 

aoa A a B a c = p A = .33 

(51) 

do a A a B a c = p A = .29 

(52) 

ao d\ (i a — Pi — .lo 

(53) 

do a B a A a c = p B = .87 

(54) 

aoafa A a B = pf — .52 

(55) 

do df a A a B = P 2 = .48 

(56) 

a A — a A + a A + a A 

(57) 

a — + a 2 

(58) 

„c _ „c , „c 
a — a x t a 2 

(59) 


It can readily be verified that the solution to Equations 49 - 56 is 


do = 1 

a A — 1 a B = 1 a c = 1 

= .38 q a = .33 a A = .29 (60) 

af = .13 of = .87 
af = .52 a$ = .48 


which simply means that for this simple case where there are no constraining higher-order 
probabilities, the a values are just the values of the associated first-order probabilities. 

Substituting these values into Equations 29 - 31, we find that the higher-order probabilities 
are just equal to the product of the corresponding first-order probabilities. This indicates that 
the maximum entropy approach has distributed the higher-order probabilities based on the 
attributes being statistically independent — as we would expect, having no other information. 

Thus from Equation 12 


ABC ABC 

Pijk =PiPjPk 


(61) 
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and 


( 62 ) 


AB 

Pi} 


"AB V' 

Pi Pi l^Pk - Pi Pi 


Lacking other information, we will assume that the probability of there being one more 
constraint is equal to the probability that there are no more constraints. (If prior information 
is available about the possibility of remaining constraints, then this is easily incorporated.) 
Assuming equality, 

p(H2')=p(Hl) (63) 

resulting in these terms cancelling in Equation 43, simplifying the calculations for (m2 — ml). 
Note (using Equations 45 and 46 for m 2 and mi) that 


If p{H 2') — .6 so that p(H 1 ) == .4, this makes a difference of —.40 in (m2 - ml). 

If p(H2') = .8 so that p(H 1) — .2, this makes a difference of —1.39 in (m2 — ml). 

For our example, Table 1 gives the values of the predicted second-order probabilities (cal- 
culated from conditional independence), and values for the observed N(j, Nik, Njk and their 
predicted mean and standard deviation. Also given are the values of (m2 — ml) — - indicative 
of the statistical significance of the observed values — and the resulting likelihood ratio of the 
two hypotheses. For our example, there are 16 second order cells from which to choose the 
significant cell. (Note that even p(H2') = .8 only changes the sign of (m2 — ml) for one of the 
values in our example.) 

CALCULATING a VALUES FOR HIGHER-ORDER CON- 
STRAINTS 

If we select N$ , from Table 1, as the first statistically significant data value to investigate, 
then (by including the associated aff) we obtain from Equations 25 - 28 and 30 the following 
equations for finding the new values of the o’s 


c [ a \ a l + b + ( a 2 + a 3 ) aC ] = 1 

(64) 

c [ a i a i + = Pi = -38 

(65) 

caf oF — p % = .33 

( 66 ) 

ca$a G = P 3 = .29 

(67) 

a 0 af[afa G + b + (cij + a* )a c ] = pf = .13 

( 68 ) 

aoa? [ a i a 2 + b + (af + 03 )a c ] = pf - .87 

(69) 
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II 

H 

*0 

N,i c 



No. of sd’s 

m 2 - mi 

muatan 
If entail 

pn = .376 X .519 = .195 

540 

668 

23.2 

-5.54 


<.1 

p n = .376 x .481 = .181 

750 

620 

22.5 

5.75 

-9.95 

<.1 

pn = .331 x .519 = .172 

642 

590 

22.1 

2.37 

2.87 

17.6 

pn = .331 x .481 = .159 

491 

545 

21.4 

-2.52 

2.63 

13.9 

03i = -293 X .519 = .152 

598 

593 

22.1 

.22 

-.64 

.5 

p 32 = .293 X .481 = .141 

407 

483 

20.4 

-3.75 

-1.49 

.2 


Table 1: CALCULATED PARAMETER VALUES USEFUL FOR DETERMINING STATIS- 
TICALLY SIGNIFICANT SECOND ORDER ATTRIBUTE DATA 


17 



















































where we have defined 


(70) 

(71) 

(72) 

c = aoa B (73) 

b = afafaff (74) 

Observe that if we add Equations 68 and 69 we obtain Equation 64. Thus Equations 68 
and 69 do not contribute to the evaluation of the other a’s. This result is to be expected as the 
B attribute is not part of our latest constraint. 

Incorporating Equations 57 - 59, Equations 64 - 74 can be put in the following order for 
iteratively solving for the a values. 


.219 

b = — — from Equation 72 
c 

(75) 

.38 

c = A — - from Equation 65 
a 1 flj 4" b 

(76) 

A .33 c 

af = — a from Equation 66 

(77) 

29 

af = - — a c from Equation 67 

(78) 

(— - 6) 

a 2 = ~~a r from Equation 71 

af + af 

(79) 

52 

af — - — a A from Equation 70 

(80) 

aP = af + af from Equation 59 

(81) 

= — ^ — — from Equation 64 

(82) 

a A ~ a i + a £ + cl a from Equation 57 

(83) 

.13 

— r~r~n y A Trim f rom Equation 68 

a 0 {a A af + b+(a A + a A )a c ) 

(84) 

87 

af = ^af from Equation 69 

(85) 


a B = af + af from Equation 58 (86) 

Q 

ao = — 5 - from Equation 73 (87) 

a a 

Starting with the initial values given by Equation 60 for the a’s without the Nyf constraint, 
Equations 75 - 83 are iteratively solved, in the given order, to obtain a new set of a values that 


caf a A = Pi = .52 
c[b + afaf + afaf\ = pf = .48 

fjAC 

Cb - (Pif)data = = -219 
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A A C C CAABBB 
b c a2 a3 a2 al a al a al a2 a aO 


0 


1 

.33 

.29 

.48 

.52 

1 

CO 

CO 

1 

.13 

.87 

1 1 

1 

.210 

1 .91 

.36 

.32 

.45 

.55 

1 

.36 

1.04 




2 

.24 

.85 

.39 

.34 

.44 

.59 

1.03 

.31 

1.04 




3 

.26 

.86 

.37 

.33 

.43 

.58 

1.01 

.34 

1.04 




4 

.26 

.84 

.39 

.34 

.43 

.59 

1.02 

.31 

1.04 




5 

.26 

.85 

.38 

.33 

.43 

.59 

1.02 

.33 

1.04 

.14 

.91 

1.05 .81 

6 










.15 

.95 

1.10 .77 

7 










.15 

.95 

1.10 .77 

Table 

2: 

ITERATIVE 

CALCULATION 

OF 

a VALUES TO 

SATISFY 

THE 

Nff CON- 


STRAINT 


satisfy the N$P constraint. Then af, af , a B , and ao are calculated from Equations 84 - 87 
using these values. 


Table 2 presents the results of the iterations to find the new a values. After convergence, new 
values of (m2 — ml) are calculated using the probabilities determined from these latest a values. 
The resultant new highest significant N uu is then selected and its associated probability added 
as a new constraint. Then, starting with the last previously calculated a values, a new set of a 
values is iteratively calculated to satisfy this additional constraint. This procedure is repeated 
until all the significant second-order probabilities are accounted for. Starting with the resultant 
latest set of a values, the procedure is then repeated for the next higher-order combination of 
attributes, etc. The overall procedure for finding significant correlations is outlined by the flow 
diagram shown in Figure 3. The procedure for calculating the a values is outlined by the flow 
diagram shown in Figure 4. 
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•f- Current set of a values 



i 


Add to the current a’s a new a associated with with the 
most significant N not yet incorporated 
(e g., for N t j B ) 


Done 


_ I _ . 

Starting with this augmented set of current a’s, iteratively 
compute new values for the a’s using equations such as 
Equations 25- 31. 


Figure 4: CALCULATING a VALUES 
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A 

B 

C 

Sample No 

1 2 3 

1 2 

1 2 

1 

X 

X 

X 

2 

X 

X 

X 

3 

X 

X 

X 

4 

X 

X 

X 


Figure 5: ORIGINAL DATA FORM 



ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

ABC 

Sample No 

111 

121 

112 

122 

211 

221 

212 

222 

311 

321 

312 

322 

1 

2 


X 





X 






3 








X 





4 









X 




Etc. 














ie mm 

f^ABG 

iV 121 

xtABC* 

iV 112 

AT AB& 
iV 122 

\jABG 

ly 2ll 


AjABG 

* y 2l2 

j\rABG 

iy 222 

kj'abO 

iV 311 

ajABG 
32 1 

ktAQC 

iy Z\2 

AT ABC 
iy Z22 

sum = 

130 

410 

110 

640 

62 

580 

31 

460 

78 

520 

22 

385 


Figure 6: SAMPLE DATA IN TRIPLES FORM 


APPENDIX A 

CONVERTING ORIGINAL DATA TO CONTINGENCY 
TABLE FORM 


For our sample problem, the original data in response to the questionaire might be in the 
form of Figure 5 (or can readily be placed in that form). 

Put into attribute triples form, this data might appear in terms of ijk values of the attributes 
as shown in Figure 6. Note that the summations of the triples are the values of the cells in 
Figure 1. 
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APPENDIX B 


CALCULATING SUM OF PRODUCTS EQUATIONS INVOLV- 
ING a’s 


From Equation 12, we have the basic equation for the nth order probability as: 

Pijk • • • = aooidj . . . aij . . . ( 88 ) 

based upon which the basic equations for the a’s and the p’s are given by Equations 21 - 24. 
These equations all involve the summations of products of a’s. If we order these summations, 
we obtain equations such as Equation 25 


ao 


— 'y a i a j a ij y 1 a k a ik a jk 


(89) 


A convenient way to handle such summation products is to introduce matrices. 

Let us define the operator X indicating comparable term-by-term matrix multiplication. 
Thus for example 


h-i 

09 

1 

v 

1 

*c> 

1 


a 36 1 

tc 

1 

A 

c d J 


TP 

O 

1 


(90) 


The summation operator, E> can be considered to be summing terms in such matrices in 
the following manner 


Mj = Ei My = Ei 


m n mn 
mu m22 


[ mu + m2i mi2 + m22 ] 


(91) 


M = Ej My = E; 


mn mu 

m 2 i m 2 2 


mu + mi2 

m 2 i + m 22 


In this notation, Equation 89 can be written as 


(92) 


a 0 ~ E» a i E; a j a ij Ejfc a k<Hk a jk 
=Ei(Qi x T.jiQijX ZkQijk)) 


In Equation 93, for 


7 = 3, J — 2, and K — 2 


(corresponding to the number of values of the attributes A, B, and C in our example) 


(93) 


(94) 


where 


X) Qiik = Si j = 

k 


«n «i2 
8 21 «22 
831 S 32 


( 95 ) 
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and 


( 96 ) 


n CAC n BC , C AC BC 
a 1 O u dji 0-2 a i 2 a j '2 


where 


and 


where 



9ll 912 
921 922 
931 932 


9t'j — a j a ij 


Qi — 


91 

92 

93 


9« = a i 

Using this notation, Equation 93 can be written as 

~^ = 12i a i Ej a j a ij 12k a k a ik a ]k 

=Ei(Q< x Zj(Qa X Ek Qijk) ) 
=Ei{Qi x EjiQij x Sh) ) 
=Zi(Qi x 


where 


and 


=S 


S { = 


«i 

«2 

«3 


Si --= 9,1 s,l + 9,2S,'2 


S — «1 + ®2 + «3 

Observe the recursive nature of these equations by noting that in general 


£ (Qn+l X S n+1 ) 


n+l index 


(97) 

(98) 

(99) 

( 100 ) 


( 101 ) 


( 102 ) 

(103) 

(104) 

(105) 
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where n is the order of the matrix and the nth index is the index of the nth attribute. 


For R being the highest order of the attributes (R = 3 in our example), 
Sr = [/] , a unity matrix in our notation so that 


(106) 


Qr X Sr — Qr (107) 

If using Equation 88, we desire to calculate a probability, p*, then Equation 88 takes the 
form of Equation 28 


ao ak aidik Yf, a j a ij a jk = Pk 
» ) 

or in our matrix notation 

«0 Qk £(Q,Jfe X Y2 Qijk) = Pk=[p C i pi ] 


where 


qijk — aja.ija.jk 

qik = a,a,fc 


Thus, for our example 


and 


Sik = 


Qik = 


B AB „BC , „B AB n BC . n AB -BC . B „AB BC 

a l a ll a ll ' a 2 a 12 °21 a i a ll °12 f “2 a 12 a 22 

n B n AB n BC . n B n AB n BC „B n AB n BC . „B AB BC 

Ctj a 21 On T O 2 022 a 21 a 2 a 21 a 12 ' a 2 a 22 a 22 

| n B n AB n BC , n B n AB n BC „B „AB „BC , „B .AB .BC i 

L a l a 31 Oj! 02 O32 a 21 02 O sl Oj2 T 0 2 0 32 0 22 J 


(108) 

(109) 

( 110 ) 
(111) 


qk = a k 


(112) 

Pk = Pk 


(113) 

n A n AC 

a i a n 

a faff 


af On C 

a faff 

(114) 

1 

0 

w;* 

a faff 



(115) 


If, as an example, none of the to,*’ s are significant, then we replace the corresponding a,*’s 
with l’s, as from Equation 13 


d ik = e 


In this case, Equation 114 reduces to 

Qik = 


e 

v>ik 

af 

nA 

„A 

4 

o 2 

a 2 

„A 

“3 

a 3 


(116) 


(117) 
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so that 


-s* = £(<?,* x Sa) 

i 

becomes 


(118) 



/ afiafatfaff + afatfa™) \ 

+a^ (af a 2\ a ix + a.2 a 22 a 2l) 

V +a£(afa^af 1 c +afa& B af 1 c ) ; 


( af (af aff af 2 c + afaffalff) > 
+a£(af «n B af 2 c + afaffaff) 

K +af (afa£ B af 2 c + af a^ 2 B af 2 c ) ; 


(119) 


= [ «1 32 ] 


and 


P * : =[pF P? ] 

=[aoafsi 


( 120 ) 
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